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Abstract 

CLEAR  is  a  first  of  its  kind  framework  which  overcomes  a  major 
challenge  in  the  design  of  digital  systems  that  are  resilient  to 
reliability  failures:  achieve  desired  resilience  targets  at  minimal  costs 
(energy,  power,  execution  time,  area)  by  combining  resilience 
techniques  across  various  layers  of  the  system  stack  (circuit,  logic, 
architecture,  software,  algorithm).  CLEAR  automatically  and 
systematically  explores  the  large  space  of  techniques  and  their 
combinations  (586  cross-layer  combinations  in  this  paper),  derives 
cost-effective  solutions,  and  provides  guidelines  for  designing  new 
techniques.  Carefully  optimized  combinations  of  circuit-level 
hardening,  logic-level  parity  checking,  and  micro-architectural 
recovery  provide  highly  cost-effective  soft  error  resilience  for 
general-purpose  processor  cores.  50*  silent  data  corruption  rate 
improvement  is  achieved  at  2.1%  energy  cost  for  out-of-order  (6.1% 
for  in-order)  cores,  with  no  speed  impact.  Selective  circuit-level 
hardening  alone,  guided  by  thorough  application  benchmark  analysis, 
also  provides  cost-effective  solutions  (~1%  additional  energy  cost  for 
the  same  50x  improvement). 

1.  Introduction 

This  paper  addresses  the  cross-layer  resilience  challenge  for 
designing  robust  digital  systems:  given  a  set  of  resilience  techniques 
at  various  abstraction  layers  (circuit,  logic,  architecture,  software, 
algorithm),  how  does  one  protect  a  given  design  from  radiation- 
induced  soft  errors  using  (perhaps)  a  combination  of  these 
techniques,  across  multiple  abstraction  layers,  such  that  overall  soft 
error  resilience  targets  are  met  at  minimal  costs  (energy,  power, 
execution  time,  area)?  Specific  soft  error  resilience  targets  addressed 
in  this  paper  are:  Silent  Data  Corruption  ( SDC ),  where  an  error  causes 
the  system  to  output  an  incorrect  result  without  error  indication;  and, 
Detected  but  Uncorrected  Error  {DUE),  where  an  error  is  detected 
(e.g.,  by  a  resilience  technique  or  a  system  crash  or  hang)  but  is  not 
recovered  automatically  without  user  intervention. 

The  need  for  cross-layer  resilience,  where  multiple  error  resilience 
techniques  from  different  layers  of  the  system  stack  cooperate  to 
achieve  cost-effective  error  resilience,  is  articulated  in  several 
publications  (e.g.,  [Carter  10,  Gupta  14,  Pedram  12]).  There  are 
numerous  publications  on  error  resilience  techniques,  many  of  which 
span  multiple  abstraction  layers.  However,  these  publications  mostly 
describe  specific  implementations  (e.g.,  [Lu  82,  Meaney  05,  Sinharoy 
11].  Cross-layer  resilience  implementations  in  commercial  systems 
are  often  based  on  “designer  experience”  or  “historical  practice.” 
There  exists  no  comprehensive  framework  to  systematically  address 
the  cross-layer  resilience  challenge.  Creating  such  a  framework  is 
difficult.  It  must  encompass  the  entire  design  flow  end-to-end,  from 
comprehensive  and  thorough  analysis  of  various  combinations  of 
error  resilience  techniques  all  the  way  to  layout-level 
implementations,  such  that  one  can  (automatically)  determine  which 
resilience  technique  or  combination  of  techniques  (at  the  same  or 
across  different  abstraction  layers)  should  be  chosen.  Such  a 
framework  is  essential  in  order  to  answer  important  cross-layer 
resilience  questions  such  as: 

1 .  Is  cross-layer  resilience  the  best  approach  for  achieving  a  given 
resilience  target  at  low  cost? 

2.  Are  all  cross-layer  solutions  equally  cost-effective?  If  not,  which 
cross-layer  solutions  are  the  best? 

3.  How  do  cross-layer  choices  change  depending  on  application- 
level  energy,  latency,  and  area  constraints? 

4.  How  can  one  create  a  cross-layer  resilience  solution  that  is  cost- 
effective  across  a  wide  variety  of  application  workloads? 

5.  Are  there  general  guidelines  for  new  error  resilience  techniques 
to  be  cost-effective? 

CLEAR  (Cross-Layer  Exploration  for  Architecting  Resilience)  is  a 
first  of  its  kind  framework,  which  addresses  the  cross-layer  resilience 
challenge  [Cheng  16a,  16b],  In  this  paper,  we  focus  on  the  use  of 


CLEAR  for  radiation-induced  soft  errors  in  terrestrial  environments. 

Although  the  soft  error  rate  of  an  SRAM  cell  or  a  flip-flop  stays 
roughly  constant  or  even  decreases  over  technology  generations,  the 
system-level  soft  error  rate  increases  with  increased  integration  [Mitra 
14,  Seifert  12]  and  can  increase  when  lower  supply  voltages  are  used 
to  improve  energy  efficiency  [Mahatme  13,  Pawlowski  14].  We  focus 
on  flip-flop  soft  errors  because  design  techniques  to  protect  them  are 
generally  expensive.  Coding  techniques  are  routinely  used  for 
protecting  on-chip  memories.  Combinational  logic  circuits  are 
significantly  less  susceptible  to  soft  errors  and  do  not  pose  a  concern 
[Gill  09,  Seifert  12],  We  address  both  single-event  upsets  ( SEUs )  and 
single-event  multiple  upsets  ( SEMUs )  [Lee  10,  Pawlowski  14],  While 
CLEAR  can  address  soft  errors  in  various  digital  components  of  a 
complex  System-on-a-Chip  (e.g.,  uncore  [Cho  15],  hardware 
accelerators),  this  paper  focuses  on  soft  errors  in  processor  cores. 

To  demonstrate  the  effectiveness  and  practicality  of  CLEAR,  we 
explore  586  cross-layer  combinations  using  ten  representative  error 
detection/correction  techniques  and  four  hardware  error  recovery 
techniques  spanning  various  layers  of  the  system  stack:  circuit,  logic, 
architecture,  software,  and  algorithm  (Fig.  1).  Our  exploration 
encompasses  over  9  million  flip-flop  soft  error  injections  into  two 
diverse  processor  core  architectures:  a  simple  in-order  SPARC  Leon3 
core  ( InO-core )  [Leon]  and  a  complex  super-scalar  out-of-order 
Alpha  IVM  core  ( OoO-core )  [Wang  04],  across  18  benchmarks.  Such 
extensive  exploration  enables  us  to  conclusively  answer  the  above 
cross-layer  resilience  questions: 

1.  For  a  wide  range  of  error  resilience  targets,  optimized  cross¬ 
layer  combinations  can  provide  low  cost  solutions  for  soft  errors. 

2.  Not  all  cross-layer  solutions  are  cost-effective. 

a.  For  general-purpose  processor  cores,  a  carefully  optimized 
combination  of  selective  circuit-level  hardening,  logic-level  parity 
checking,  and  micro-architectural  recovery  provides  a  highly 
effective  cross-layer  resilience  solution. 

b.  When  the  application  space  is  restricted  to  matrix  operations, 
a  combination  of  Algorithm  Based  Fault  Tolerance  (ABFT) 
correction,  selective  circuit-level  hardening,  logic-level  parity 
checking,  and  micro-architectural  recovery  can  be  highly  effective. 

c.  Selective  circuit-level  hardening,  guided  by  a  thorough 
analysis  of  the  effects  of  soft  errors  on  application  benchmarks, 
provides  a  highly  effective  soft  error  resilience  approach. 

3.  The  above  conclusions  about  cost-effective  soft  error  resilience 
techniques  largely  hold  across  various  application  characteristics  (e.g., 
latency  constraints  despite  errors  in  soft  real-time  applications). 

4.  One  must  address  the  challenge  of  potential  mismatch  between 
application  benchmarks  vs.  applications  in  the  field,  especially  when 
targeting  high  degrees  of  resilience.  We  overcome  this  challenge 
using  various  flavors  of  circuit-level  hardening  techniques  (Sec.  4). 

5.  Cost-effective  approaches  discussed  above  provide  bounds  that 
new  soft  error  resilience  techniques  must  achieve  to  be  competitive. 

2.  CLEAR  Framework 

Figure  1  gives  an  overview  of  the  CLEAR  framework. 

2.1  Reliability  Analysis 

We  use  flip-flop  soft  error  injections  for  reliability  analysis 
(radiation  test  results  confinn  that  injection  of  single  bit- flips  into  flip- 
flops  closely  models  soft  error  behaviors  in  actual  systems  [Bottom 
14,  Sanda  08]).  Flip-flop-level  error  injection  is  crucial  since  naive 
high-level  error  injections  can  be  highly  inaccurate  [Cho  13]. 

We  injected  over  9  million  flip-flop  soft  errors  into  the  RTL  of  the 
processor  designs  using  three  BEE3  FPGA  emulation  systems  and 
also  using  mixed-mode  simulations  on  the  Stampede  supercomputer 
(TACC  at  The  University  of  Texas  at  Austin)  (similar  to  [Cho  13, 
Wang  04]).  This  ensures  that  error  injection  results  have  less  than  a 
0.1%  margin  of  error  with  a  95%  confidence  interval  per  benchmark. 
Errors  are  injected  uniformly  into  all  flip-flops  and  application 
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(a)  Reliability  analysis  / 
execution  time  evaluation 


(b)  Physical  design  evaluation 


(c)  Resilience  library 


Error  detection  /  correction  techniques 


Alg. 

(1)  ABFT  correction  [Chen  05,  Huang  84] 

(2)  ABFT  detection  [Tao  93] 

sw 

(3)  Assertions  [Sahoo  08,  Hari  12] 

(4)  CFCSS  [Oh  02a] 

s 

V 

(5)  EDDI  [Oh  02b] 

Arch. 

(6)  DFC  [Meixner  07] 

(7)  Monitor  core  [Austin  99] 

Logic 

(8)  Parity  checking  [Spainhower  99] 

Circuit 

(9)  LEAP-DICE  [Lee  10,  Lilja  13] 

(10)  EDS  [Bowman  09,  11] 

Recorery  techniques 


(1)  Instruction  Replay  (IR) 
[Meaney  05] 


(2)  Extended  IR  (EIR) 

(3)  Flush  [Racunas  07] 


(4)  Reorder  Buffer  (RoB) 
[Wang  05] 


(d)  Cross-layer  evaluation 
586  total  combinations 


300  n 
100 
30  1 
9 

v  Energy 
f  cost(%)  g  . 


0  20  40  60  80  100 

%  SDC-causing  errors  protected 


Figure  1.  CLEAR  Framework:  (a)  BEE3  emulation  cluster  /  Stampede  supercomputer  injects  over  9  million  errors  into  two  diverse  processor 
architectures  running  18  full-length  application  benchmarks,  (b)  Accurate  physical  design  evaluation  accounts  for  resilience  overheads,  (c) 
Comprehensive  resilience  library  consisting  often  error  detection  /  correction  techniques  +  four  hardware  error  recovery  techniques,  (d)  Example 
illustrating  thorough  exploration  of  586  cross-layer  combinations  with  varying  energy  costs  vs.  percentage  of  SDC-causing  errors  protected. 


regions,  to  mimic  real  world  scenarios. 

The  SPECINT2000  [Henning  00]  and  DARPA  PERFECT 
[DARPA]  benchmark  suites  are  used  for  evaluation1.  The  PERFECT 
suite  complements  SPEC  by  adding  applications  targeting  signal  and 
image  processing  domains.  We  ran  benchmarks  in  their  entirety. 

Flip-flop  soft  errors  can  result  in  the  following  outcomes  [Cho  13, 
Sanda  08,  Wang  04]:  Vanished  -  normal  termination  and  output  files 
match  error-free  runs,  Output  Mismatch  (OMM)  -  normal 
termination,  but  output  files  are  different  from  error-free  runs, 
Unexpected  Termination  (UT)  -  program  terminates  abnonnally, 
Hang  -  no  termination  or  output  within  2x  the  nominal  execution 
time,  Error  Detection  (ED)  -  an  employed  resilience  technique  flags 
an  error,  but  the  error  is  not  recovered  using  hardware  recovery. 

Any  error  that  results  in  OMM  causes  SDC.  Any  error  that  results 
in  UT,  Hang,  or  ED  causes  DUE  (there  are  no  ED  outcomes  if  no  error 
detection  technique  is  employed).  The  resilience  of  a  protected  (new) 
design  compared  to  an  unprotected  (original,  baseline)  design  can  be 
defined  in  terms  of  SDC  improvement  (Eq.  la)  or  DUE  improvement 
(Eq.  lb).  Techniques  that  increase  execution  time  or  add  flip-flops 
increase  the  susceptibility  of  the  design  to  soft-errors.  To  accurately 
account  for  this  situation,  we  calculate,  based  on  [Schinneier  15],  a 
correction  factor  y  (where  y  >  1 ),  which  is  applied  to  ensure  a  fair  and 
accurate  comparison2.  Reporting  SDC  and  DUE  improvements  allows 
our  results  to  be  agnostic  to  absolute  error  rates. 


SDC  improvement  = 


( original  OMM  count) 
(new  OMM  count) 


(Eq.  la) 


DUE  improvement  = 


(original  (UT+Hang)  count) 
(new  (UT+Hang+ED)  count) 


(Eq.  lb) 


2.2  Execution  Time  Evaluation 

Execution  time  is  estimated  using  FPGA  emulation  and  RTL 
simulation.  Applications  are  run  to  completion.  Our  design 
methodology  maintains  clock  speed. 

2.3  Physical  Design  Evaluation 

Synopsys  design  tools  (Design  Compiler,  IC  compiler,  Primetime) 
along  with  a  commercial  28nm  technology  library  (with 
corresponding  SRAM  compiler)  is  used  to  perform  synthesis,  place- 
and-route,  and  power  analysis.  Synthesis  and place-and-route  ( SP&R ) 
was  run  for  all  configurations  of  the  design  (before  and  after  adding 
resilience  techniques)  to  ensure  all  constraints  of  the  original  design 
(e.g.,  timing  and  DRC)  were  met  for  the  resilient  designs. 

2.4  Resilience  Library 

We  carefully  chose  ten  error  detection  and  correction  and  four 
hardware  error  recovery  techniques.  These  techniques  largely  cover 
the  space  of  existing  soft  error  resilience  techniques.  The 
characteristics  of  each  technique  when  used  as  a  standalone  solution 
(e.g.,  an  error  detection  /  correction  technique  by  itself  or,  optionally, 
in  conjunction  with  a  recovery  technique)  are  presented  in  Table  1 . 

Circuit:  The  hardened  flip-flops  (LEAP -DICE,  Light  Hardened 
LEAP)  in  Table  2  are  designed  to  tolerate  SEUs  and  SEMUs  [Lee  10, 
Lilja  13],  Error  Detection  Sequential  ( EDS)  [Bowman  09,  1 1]  can  be 
used  to  detect  flip-flop  soft  errors  (in  addition  to  timing  errors). 


Table  2.  Resilient  flip-flops. 


lype 

Soft  brror  Kate 

Area 

Power 

Delay 

Energy 

Baseline 

i 

i 

i 

i 

i 

Liqht  Flardened  LEAP  (LHL) 

2.5x10" 

1.2 

1.1 

1.2 

1.3 

LEAP-DICE 

2. Ox  10" 

2.0 

1.8 

1 

1.8 

EDS® 

-100%  detect 

1.5 

1.4 

1 

1.4 

Table  1 .  Individual  resilience  techniques:  costs  and  improvements  when  implemented  as  a  standalone  solution. 


Layer 

Technique 

Area 

cost 

Power 

cost 

Energy 

cost 

Exec,  time 
impact 

Avg.  SDC 
improvement 

Avg.  DUE 
improvement 

False 

positive 

LEAP-DICE 

(no  additional  recovery  needed) 

InO 

OoO 

0-9.3% 

0-6.5% 

0-22.4% 

0-9.4% 

0-22.4% 

0-9.4% 

0% 

lx  -  5,000x 

lx  -  5,000x 

0% 

Circuit3 

EDS 

(without  recovery  -  unconstrained) 

InO 

OoO 

0-10.7% 

0-12.2% 

0-22.9% 

0-11.5% 

0-22.9% 

0-11.5% 

0% 

lx  - 

100,000x 

0.1x  -  lx 

0% 

EDS 

(with  IR  recovery) 

InO 

OoO 

0-16.7% 

0-12.3% 

0-43.9% 

0-11.6% 

0-43.9% 

0-11.6% 

0% 

lx  - 

100,000x 

lx  - 

100,000x 

0% 

Logic3 

Parity 

(without  recovery  -  unconstrained) 

InO 

OoO 

0-10.9% 

0-14.1% 

0-23.1% 

0-13.6% 

0-23.1% 

0-13.6% 

0% 

lx  - 

100,000x 

0. lx  -  lx 

0% 

Parity 

(with  IR  recovery) 

InO 

OoO 

0-26.9% 

0-14.2% 

0-44% 

0-13.7% 

0-44% 

0-13.7% 

0% 

lx  - 

100,000x 

lx  - 

100,000* 

0% 

DFC 

(without  recovery  -  unconstrained) 

InO 

OoO 

3% 

0.2% 

1% 

0.1% 

7.3% 

7.2% 

6.2% 

7.1% 

1.2x 

0.5x 

0% 

Arch. 

DFC 

(with  EIR  recovery) 

InO 

OoO 

37% 

0.4% 

33% 

0.2% 

41.2% 

7.3% 

612% 

7.1% 

1.2x 

1.4x 

0% 

Monitor  core  (with  RoB  recovery) 

OoO 

9% 

16.3% 

16.3% 

0% 

T9x 

T5x 

0% 

Soft- 

Software  assertions  for  general-purpose 

processors  (without  recovery  -  unconstrained) 

InO 

0% 

0% 

15.6% 

15.6%5 

1.5x 

0.6x 

0.003% 

ware4 

CFCSS  (without  recovery  -  unconstrained) 

InO 

0% 

0% 

40.6% 

40.6% 

TTSx 

36S 

0% 

EDDI  (without  recovery  -  unconstrained) 

InO 

0% 

0% 

110% 

rro% 

37.8x» 

013x 

0% 

Alg. 

ABFT  correction  (no  additional  recovery  needed) 

InO 

OoO 

0% 

0% 

1 .4% 

1.4% 

4.3x 

1.2x 

0% 

ABFT  detection  (without  recovery  -  unconstrained) 

InO 

OoO 

0% 

0% 

24% 

1-56.9%7 

3.5x 

0.5x 

0% 

1 1  SPEC  /  7  PERFECT  (InO),  8  SPEC  /  3  PERFECT  (OoO). 

Research  literature  commonly  considers  y=l.  We  use  tme  y  values,  but  our 
conclusions  hold  for  y=l  as  well  (latter  is  optimistic). 

Circuit  and  logic  techniques  have  tunable  costs/resilience 

LLVM  compiler  no  longer  supports  the  Alpha  architecture  (OoO-core). 


Some  assertions  (e.g.,  [Sahoo  08])  have  false  positives  (i.e.,  error  reported  during 
error-free  run).  Execution  time  impact  reported  discounts  false  positives. 

EDDI  with  store-readback  [Lin  14].  3.3x  SDC  /  0.4 x  DUE  improvement  without. 
Error  detection  checks  may  require  computationally-expensive  calculations. 

EDS  costs  for  the  flip-flop  only.  Error  signal  routing,  delay  buffers  increase  cost. 
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Logic:  Parity  checking  detects  errors  by  checking  flip-flop  inputs 
and  outputs  [Spainhower  99].  Our  design  heuristics  ([Cheng  16b]) 
reduce  the  cost  of  parity  and  ensure  that  clock  frequency  is  maintained 
as  in  the  original  design.  SEMUs  are  minimized  through  layouts  that 
ensure  a  minimum  spacing  (the  size  of  one  flip-flop)  between  flip- 
flops  checked  by  the  same  parity  checker  [Amusan  09]. 

Architecture:  Our  implementation  of  Data  Flow  Checking  ( DFC ) 
includes  Control  Flow  Checking  (CFC),  and  resembles  [Meixner  07], 
Monitor  cores  are  specialized  checker  cores  that  validate  instructions 
executed  by  the  main  core  [Austin  99,  Lu  82],  We  analyze  monitor 
cores  similar  to  [Austin  99]  and  confirmed  (via  IPC  estimation)  that 
our  monitor  core  implementation  does  not  stall  the  main  core. 

Software:  Software  assertions  for  general-purpose  processors 
check  program  variables  to  ensure  their  values  are  valid.  We  combine 
assertions  from  [Hari  12,  Sahoo  08],  Control  Flow  Checking  by 
Software  Signatures  ( CFCSS )  [Oh  02a]  and  Error  Detection  by 
Duplicated  Instructions  ( EDDI)  [Oh  2b]  are  implemented  via 
compiler  modification.  We  utilize  EDDI  with  store-readback  [Lin  14], 

Algorithm:  Algorithm  Based  Fault  Tolerance  (ABFT)  can  detect 
( ABFT  detection)  or  detect  and  correct  errors  ( ABFT  correction) 
through  algorithm  modifications  [Chen  05,  Huang  84,  Tao  93], 

Recovery:  We  consider  two  recovery  scenarios:  bounded  latency, 
i.e.,  an  error  must  be  recovered  within  a  fixed  period  of  time  after  its 
occurrence,  and  unconstrained,  i.e.,  where  no  latency  constraints  exist 
and  errors  are  recovered  externally  once  detected  (no  hardware 
recovery  is  required).  Bounded  latency  recovery  requires  one  of  the 
following  hardware  recovery  techniques  (Table  3):  flush  or  Reorder 
Buffer  ( RoB )  recovery  (flushing  non-committed  instructions  followed 
by  re-execution)  [Racunas  07,  Wang  05];  instruction  replay  (IR)  or 
extended  instruction  replay  ( EIR )  recovery  (instruction  checkpointing 
to  rollback  and  replay  instructions)  [Meaney  05],  EIR  is  an  extension 
of  IR  with  additional  buffers  required  by  DFC  for  recovery.  Flush  and 
RoB  are  unable  to  recover  from  errors  detected  after  the  memory  write 
stage  of  InO-cores  or  after  the  reorder  buffer  of  OoO-cores, 
respectively  (these  errors  will  have  propagated  to  architecture  visible 
states).  Hence,  LEAP-DICE  is  used  to  protect  flip-flops  in  these 
pipeline  stages  when  using  flush/RoB  recovery. 


Table  3.  Hardware  error  recovery  costs. 


Core 

Type 

Area 

Power 

Energy 

Recovery 

latency 

InO 

Instruction  Replay  (IR)  recovery 

16% 

21% 

21% 

47  cycles 

EIR  recovery 

34% 

32% 

32% 

47  cycles 

Flush  recovery 

0.6% 

0.9% 

1.8% 

/  cycles 

OoO 

Instruction  Replay  (IR)  recovery 

0.1% 

0.1% 

0.1% 

1 04  cycles 

tIR  recovery 

0.2% 

0.1% 

0.1% 

1 04  cycles 

Reorder  Buffer  (ROB)  recovery 

0.01% 

0.01% 

0.01% 

64  cycles 

3.  Cross-Layer  Combinations 

CLEAR  uses  a  top-down  approach  to  explore  the  cost-effectiveness 
of  various  cross-layer  combinations.  For  example,  resilience 
techniques  at  upper  layers  of  the  system  stack  (e.g.,  ABFT  correction) 
are  applied  before  moving  down  the  stack  to  lower  layers  (e.g.,  an 


optimized  combination  of  logic  parity  checking,  circuit-level  LEAP- 
DICE,  and  micro-architectural  recovery).  This  approach  (example 
shown  in  Fig.  2)  ensures  that  resilience  techniques  from  various  layers 
of  the  stack  effectively  interact  with  one  another.  Resilience 
techniques  from  the  algorithm,  software,  and  architecture  layers  of  the 
stack  generally  protect  multiple  flip-flops  (determined  using  error 
injection);  however,  a  designer  typically  has  little  control  over  the 
specific  subset  protected.  Using  multiple  resilience  techniques  from 
these  layers  can  lead  to  situations  where  a  given  flip-flop  may  be 
protected  (sometimes  unnecessarily)  by  multiple  techniques.  At  the 
logic  and  circuit  layers,  fine-grained  protection  is  available  since  these 
techniques  can  be  applied  selectively  to  individual  flip-flops  (those 
not  sufficiently  protected  by  higher-level  techniques). 


Figure  2.  Cross-layer  methodology  example  for  combining  ABFT 
correction,  LEAP-DICE,  logic  parity,  and  micro-architectural  recovery. 


Among  the  586  cross-layer  combinations  explored  using  CLEAR, 
a  highly  promising  approach  combines  selective  circuit-level 
hardening  using  LEAP-DICE,  logic  parity,  and  micro-architectural 
recovery  (flush  recovery  for  InO-cores,  RoB  recovery  for  OoO- 
cores).  Thorough  error  injection  using  application  benchmarks  is 
critical  in  selecting  flip-flops  protected  using  these  techniques. 
[Cheng  16b]  details  the  methodology  for  creating  this  combination. 

When  the  application  space  targets  specific  algorithms  (e.g.,  matrix 
operations),  a  cross-layer  combination  of  LEAP -DICE,  parity,  ABFT 
correction,  and  micro-architectural  error  recovery  (flush/RoB) 
provides  additional  energy  savings  (Table  4).  Since  ABFT  correction 
performs  in-place  error  correction,  no  separate  recovery  mechanism 
is  required  for  ABFT  correction.  When  targeting  DUE  improvement, 
ABFT  correction  provides  no  energy  savings  for  the  OoO-core.  This 
is  because  ABFT  correction  perfonns  checks  at  set  locations  in  the 
program  (e.g.,  a  DUE  resulting  from  an  invalid  pointer  access  can 
cause  an  immediate  program  termination  before  a  check  is  invoked). 

Since  most  applications  are  not  amenable  to  ABFT  correction,  the 
flip-flops  protected  by  ABFT  correction  must  also  be  protected  by 
techniques  such  as  LEAP-DICE  or  parity  (or  combinations  thereof) 
for  processors  targeting  general-purpose  applications.  This  requires 
circuit  hardening  techniques  (e.g.,  [Mitra  05,  Zhang  06])  with  the 
ability  to  selectively  operate  in  an  error-resilient  mode  (high 
resilience,  high  energy)  when  ABFT  is  unavailable,  or  in  an  economy 
mode  (low  resilience,  low  power  mode)  when  ABFT  is  available. 
However,  the  overheads  outweigh  benefits  (details  in  [Cheng  16b]). 

Relative  benefits  seen  in  Table  4  are  consistent  across  benchmarks 
and  over  the  range  of  SDC/DUE  improvements.  Overheads  in  Table 
4  are  small  because  we  reported  the  most  energy-efficient  resilience 
solutions.  Most  of  the  586  combinations  are  far  costlier. 


Table  49.  Costs  vs.  SDC  and  DUE  improvements  for  efficient  resilience  techniques. 


A  (area  cost  %),  P  (power  cost  %),  E  (energy  cost  %) 


Bounded  latency  recovery 

Unconstrained  recovery lu 

Exec. 

time 

impact 

SDC  improvement 

DUE  improvement 

SDC  improvement 

DUE  improvement 

5W 

5W 

max 

5W 

5UH 

max 

5W 

50H 

max 

sir 

50H 

max 

InO 

Selective  hardening  using 
LEAP-DICE 

A 

P 

E 

0.8 

2 

2 

1.8 

4.3 

4.3 

2.9 

7.3 

7.3 

3.3 

8.2 

8.2 

9.3 

22.4 

22.4 

0.7 

1.5 

1.5 

1.7 

3.8 
3.8 

3.8 

9.5 

9.5 

5.1 

12.5 

12.5 

9.3 

22.4 

22.4 

0.8 

2 

2 

1.8 

4.3 

4.3 

2.9 

7.3 

7.3 

3.3 

8.2 

8.2 

9.3 

22.4 

22.4 

07 

1.5 

1.5 

17 

3.8 

3.8 

3.8 

9.5 

9.5 

5.1 

12.5 

12.5 

9.3 

22.4 

22.4 

0% 

LEAP-DICE  +  logic  parity 
(+  flush  recovery) 

A 

P 

E 

0./ 

1.9 

1.9 

17 

3.9 

3.9 

2.5 

6.1 

6.1 

3 

6.7 

6.7 

8 

17.9 

17.9 

0.6 

1.5 

1.5 

1.5 

3.4 

3.4 

3.6 

8.4 

8.4 

4.4 

10.4 
10.4 

8 

17.9 

17.9 

07 

1.9 

1.9 

1.6 

3.8 

3.8 

2.4 

5.9 

5.9 

2.8 

6.5 

6.5 

/.6 

17.2 

17.2 

- 

- 

- 

- 

- 

0% 

ABFT  correction  + 
LEAP-DICE  +  logic  parity 
(+  flush  recovery) 

7 T 
P 
E 

U 

0 

1.4 

0.4 

0.7 

2.2 

1.0 

1.7 

3.1 

1.2 

1.8 

3.2 

8 

17.9 

19.6 

0.3 

1 

2.4 

0.4 

1 

2.4 

1.5 

3.3 

4.8 

27 

5.7 

7.2 

8 

17.9 

19.6 

0 

0 

1.4 

0.4 

0.7 

2.2 

0.9 

1.6 

3 

1.1 

1.8 

3.2 

/.6 

17.2 

18.8 

- 

- 

- 

- 

- 

1 .4% 

OoO 

Selective  hardening  using 
LEAP-DICE 

A 

P 

E 

1.1 

1.5 

1.5 

1.3 

1.7 

1.7 

2.2 

3.1 

3.1 

2.4 

3.5 
3.5 

6.5 

9.4 

9.4 

1.3 

2 

2 

1.6 

2.3 

2.3 

3.1 

4.2 
4.2 

3.6 

5.1 

5.1 

6.5 

9.4 

9.4 

1.1 

1.5 

1.5 

1.3 

1.7 

1.7 

2.2 

3.1 

3.1 

2.4 

3.5 
3.5 

6.5 

9.4 

9.4 

1.3 

2 

2 

1.6 

2.3 

2.3 

3.1 

4.2 
4.2 

3.6 

5.1 

5.1 

6.5 

9.4 

9.4 

0% 

LEAP-DICE  +  logic  parity 
(+  ROB  recovery) 

A 

P 

E 

HOB 

0.1 

0.1 

0.1 

0.2 

0.2 

1.4 

2.1 

2.1 

2.2 

2.4 

2.4 

4.9 

7 

7 

0.5 

0.1 

0.1 

07 

0.1 

0.1 

2.6 

2 

2 

3 

1.8 

1.8 

4.9 

7 

7 

6756" 

0.1 

0.1 

0.1 

0.2 

0.2 

1.4 

2.1 

2.1 

2.2 

2.4 

2.4 

4.9 

7 

7 

- 

- 

- 

- 

- 

0% 

ABF  1  correction  + 
LEAP-DICE  +  logic  parity 
(+  ROB  recovery) 

7T 

P 

E 

U 

0 

1.4 

TTOT 

0.01 

1.5 

0.3 

0.5 

1.9 

0.5 

0.8 

2.2 

4.9 

7 

8.5 

0.4 

0.1 

1.5 

0.6 

0.1 

1.5 

2.1 

3 

4.2 

3 

1.6 

3 

4.y 

7 

8.5 

0 

0 

1.4 

UTJT 

0.01 

1.5 

0.3 

0.5 

1.9 

0.5 

0.8 

2.2 

4.8 

6.9 
8.4 

- 

- 

- 

- 

- 

1 .4% 

9  Costs  generated  per  benchmark  and  averaged.  Relative  std.  deviation:  0.6-3. 1%.  unconstrained  recovery  scenario. 

10  DUE  Improvements  not  possible  when  detection-only  techniques  are  used  in  an 
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4.  Application  Benchmark  Dependence 

The  most  cost-effective  resilience  techniques  are  guided  by  error 
injection  using  application  benchmarks.  What  happens  when  the 
applications  in  the  field  do  not  match  application  benchmarks?  We 
refer  to  this  situation  as  application  benchmark  dependence.  To 
quantify  this  dependence,  we  randomly  selected  4  (of  11)  SPEC 
benchmarks  as  a  training  set,  and  used  the  remaining  7  as  a  validation 
set.  Resilience  is  implemented  using  the  training  set  and  the  resulting 
design’s  resilience  is  determined  using  the  validation  set.  We  used  50 
training/validation  pairs.  Table  5  indicates  that  validated  SDC 
improvement  is  generally  underestimated.  Fortunately,  when 
targeting  <10x  SDC  improvement,  the  underestimation  is  <4%.  This 
is  due  to  the  fact  that  the  most  vulnerable  10%  of  flip-flops  (i.e.,  the 
flip-flops  that  result  in  the  most  SDCs  or  DUEs)  are  consistent  across 
benchmarks.  Benchmark  sensitivity  may  be  minimized  by  training 
using  additional  benchmarks  or  through  better  benchmarks  (e.g., 
[Mirkhani  15]).  An  alternative  approach  is  to  apply  our  CLEAR 
framework  using  available  benchmarks,  and  then  replace  all 
remaining  unprotected  flip-flops  using  LHL  (Table  2).  This  enables 
our  resilient  designs  to  meet  (or  exceed)  resilience  targets  at  <1.2% 
additional  cost  for  SDC  improvements  >10x. 


Table  5.  SDC  improvement,  cost  before  and  after  applying  LHL  to 
_ otherwise  unprotected  flip-flops. 


Core 

SDC  improvement 

Cost  before  LHL 
insertion 

Cost  after  LHL 
insertion 

Train 

Validate 

After  LHL 

Area 

Power  / 
Energy 

Area 

Power  / 
Energy 

InO 

5x 

4.8x 

19. 3x 

1 .6% 

3.6% 

3.1% 

5.7% 

50x 

38. 9x 

152. 3x 

2.4% 

b.  /To 

3.3% 

6.9% 

500x 

433.1  x 

1,326.  lx 

2.9% 

6.3% 

3.4% 

7.1% 

Max 

5,568.9* 

5,568.9* 

8% 

17.9% 

8% 

17.9% 

OoO 

5x 

4.8x 

35.  lx 

0.1% 

0.2% 

0.9% 

1 .8% 

50x 

32.1  x 

204.3x 

1 .4% 

2.1% 

1.9% 

2.7% 

500x 

301. 4x 

1 084.1  x 

2.2% 

2.4% 

2.4% 

2.8% 

Max 

6,625.8* 

6,625.8* 

4.9% 

7% 

4.9% 

7% 

5.  Conclusions 

CLEAR  is  a  first  of  its  kind  cross-layer  resilience  framework  that 
enables  effective  exploration  of  a  wide  variety  of  resilience 
techniques  and  their  combinations  across  several  layers  of  the  system 
stack.  Our  extensive  cross-layer  resilience  studies  demonstrate: 

1.  A  carefully  optimized  combination  of  selective  circuit-level 
hardening,  logic-level  parity  checking,  and  micro-architectural 
recovery  provides  a  highly  cost-effective  soft  error  resilience  solution 
for  general-purpose  processors. 

2.  Selective  circuit-level  hardening  alone,  guided  by  thorough 
analysis  of  the  effects  of  soft  errors  on  application  benchmarks,  also 
provides  a  cost-effective  soft  error  resilience  solution  (with  ~1% 
additional  energy  cost  for  the  same  50x  SDC  improvement). 

3.  Algorithm  Based  Fault  Tolerance  (ABFT)  correction  combined 
with  selective  circuit-level  hardening,  logic-level  parity  checking,  and 
micro-architectural  recovery  can  further  improve  soft  error  resilience 
costs.  However,  existing  ABFT  correction  techniques  can  only  be 
used  for  a  few  applications  limiting  the  applicability  of  this  approach. 

4.  We  can  derive  bounds  on  energy  costs  vs.  degree  of  resilience 
(SDC  or  DUE  improvements)  that  new  soft  error  resilience  techniques 
must  achieve  to  be  competitive  (shown  in  Fig.  3). 

5.  It  is  crucial  that  the  benefits  and  costs  of  new  resilience 
techniques  are  evaluated  thoroughly  and  correctly.  Detailed  analysis 
(e.g.,  flip-flop-level  error  injection  or  layout-level  cost  quantification) 
identifies  hidden  weaknesses  that  are  often  overlooked. 


Figure  3.  New  resilience  techniques  must  have  cost  and  improvement 
tradeoffs  that  lie  within  the  shaded  regions  bounded  by  LEAP-DICE  + 
parity  +  micro-architectural  recovery. 
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